## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Introduction: red wine quality data set contains 1599 observations and 13 variables. The variables describe the chemical characteristics of the wine in addition to the quality ranking that varies from 3 to 8.
From the plot above, we can see that almost 41% of the wine observations quality are categorized 5 and almost 39% are categorized 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
It sounds that high residual.sugar is not normal in wine observations since the third quantile is 2.6 while the max is 15.5. Therefore, I limited the histogram to 3, which makes the data more normally distributed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Alcohol histogram is skewed to the right, which means the data is not normally distributed with peak near 9.5%. The median and the mean are around 10%. According to the link below, higher alcohol level makes the wine taste dry while the taste would be sweet for levels under 12.5%. Therefore, the relationship between alcohol and residual.sugar should be explored in the next section (Bivariate Analysis) source: https://www.everwonderwine.com/blog/2017/1/14/is-there-a-relationship-between-a-wines-alcohol-level-and-its-sweetness
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
pH is normally distributed. The median and the mean are around 3. Higher pH level means lower acidity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
density is normally distributed. The median and the mean are around 0.99.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
sulphates histogram is skewed to the right, which means the data is not normally distributed. The median is 0.62 and the mean is 0.66.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The volatile.acidity histogram is almost normally distributed with median and mean around 0.52
1599 observations and 13 variables ## What is/are the main feature(s) of interest in your dataset? Acidity, tannin, alcohol and sweetness are the main traits that affect the red wine quality. source: https://winefolly.com/review/understanding-acidity-in-wine/ Therefore, my exploration focuses on pH, alcohol,density and residual.sugar variables to see their effects on the quality variable
https://www.quickanddirtytips.com/health-fitness/healthy-eating/myths-about-sulfites-and-wine
No
source: https://www.statmethods.net/graphs/scatterplot.html
interesting correlations: - residual.sugar and density are strongly correlated.
- quality is strongly correlated with alcohol, citric.acid, volatile.acidity & sulphates. The problem is among these variables there are also correlations: 1- volatile.acidity & sulphates are correlated 2- citric.acid is correlated with sulphates - alcohol and pH are strongly correlated
not interesting correlations: - as expected, free.sulfur.dioxide and total.sulfur.dioxide are strongly correlated, but I will not explore this relationship since the first is part of the second. - citric.acid is negatively correlated with pH and with volatile.acidity. Higher acidity means lower pH - fixed.acidity is negatively correlated with pH and with volatile.acidity - pH & cholrides are correlated - total.sulfur.dioxide & alcohol are strongly correlated. The reason could be that the alcoholic fermentation produces sulfites - alcohol and density are negatively correlated. When alcohol level increases, wine density becomes less - alcohol and cholrides are strongly correlated - density is positively correlated with citric.acid and with fixed.acidity - density is negatively correlated with pH - sulphates & cholrides are correlated
as expected, after 12.5% alcohol level, residual.sugar becomes less since the sweetness reduces and the taste becomes dry
residual.sugar and density are strongly correlated. The reason might be that sweeter wine has higher density
These two variables are strongly correlated. It is clear that wine observations with alcohol level between 11 & 13 got the highest quality ranks
citric.acid level between 0.25 and 0.5 got the highest quality ranks. Citric acid adds flavor to wine, which explains the positive correlation
Volatile.acidity at too high of levels can lead to an unpleasant, vinegar taste, which explains the negative correlation with these two variables
Sulphates can contribute to sulfur dioxide,which in turn is used as a preservative and it affects the wine taste. This explians the positive correlation among sulphates and quality. The highest rank 8 is associated with sulphates level around 0.75
source: https://winobrothers.com/2011/10/11/sulfur-dioxide-so2-in-wine/
Alcohol and pH are positively correlated because riper wines will have higher alcohol content, lower acidity and higher pHs. source:https://www.winespectator.com/drvinny/show/id/How-Does-pH-Affect-Alcohol-in-Wine
Acidity, tannin, alcohol and sweetness are the main traits that affect the red wine quality. Therefore, I wanted to focus on pH, alcohol,density and residual.sugar variables to see their effects on the quality variable. However, after calculating the correlations among the variables in the dataset, I found out that quality is strongly correlated with alcohol, citric.acid, volatile.acidity & sulphates. Then, I focused on these 4 variables relationship with quality. In summary, I noticed the following; -alcohol level between 12 & 13 got the highest quality ranks -citric.acid level between 0.25 and 0.5 got the highest quality ranks since citric acid adds flavor to wine - volatile.acidity at too high of levels can lead to an unpleasant, vinegar taste, which explains the negative correlation with quality -sulphates can contribute to sulfur dioxide,which in turn is used as a preservative and it affects the wine taste. This explains the positive correlation among sulphates and quality
I also explored some relationships among other variables: 1- between alcohol and residual.sugar: after 12.5% alcohol level, residual.sugar becomes less since the sweetness reduces and the taste becomes dry 2- between residual.sugar and density:residual.sugar and density are strongly correlated. The reason might be that sweeter wine has higher density 3- between alcohol and pH: alcohol and pH are positively correlated because riper wines will have higher alcohol content, lower acidity and higher pHs.
as shown above, the quality rank increases with alcohol levels up to 13 and citric.acid up to 0.375. It sounds like the presence of these two together reduces the uper limit for citric.acid from 0.5 to 0.37
as shown above, the quality rank increases with alcohol levels up to 13 and citric.acid up to 0.75.
as shown above, the quality rank decreases with high volatile.acidity level close to 0.6 and increases with alcohol levels up to 13.
as shown above, the quality rank decreases with high volatile.acidity level aund 0.6. When citric.acid is plotted seperatlly, level between 0.25 and 0.5 got the highest quality ranks. However, this trend is not clear in the above plot
as shown above, the quality rank increses with sulphates level between 0.5 and 0.75 and citric.acid level above 0.25 and below 0.75. source: https://ggplot2.tidyverse.org/reference/geom_smooth.html
I investigated the relationships among 3 variables out of the 4 variables that are strongly correlated with quality: volatile.acidity, alcohol, citric.acid and sulphates.
First, I examined alcohol vs.citric.acid, then alcohol vs.sulphates, then alcohol vs.volatile.acidity. I found that high quality is always associated with alcohol level between 11 and 13 regardless of sulphates,volatile.acidity and citric.acid levels.
Second,I examined citric.acid vs.alcohol, then citric.acid vs.volatile.acidity, then citric.acid vs.sulphates. I found that citric.acid is slightly affected by volatile.acidity , alcohol and sulphates.When citric.acid is plotted alone, level between 0.25 and 0.5 got the highest quality ranks while with alcohol the upper limit decreases to 0.375 and with sulphates increases to 0.75.
These two variables are strongly correlated. It is clear that wine observations with alcohol level between 11 & 13 got the highest quality ranks
as shown above, the quality rank increases with alcohol levels up to 13 and citric.acid up to 0.375. It sounds like the presence of these two together reduces the uper limit for citric.acid from 0.5 to 0.37
as shown above, the quality rank decreases with high volatile.acidity level aund 0.6. When citric.acid is plotted seperatlly, level between 0.25 and 0.5 got the highest quality ranks. However, this trend is not clear in the above plot
Red wine quality data set contains 1599 observations and 13 variables. The quality variable is the dependent variable, which varies from 3 to 8. Almost 41% of the wine observations quality are categorized 5 and almost 39% are categorized 6, which might influnce the results since the dataset is imbalanced.
Acidity, tannin, alcohol and sweetness are the main traits that affect the red wine quality. Therefore, I decided to explor pH, alcohol,density and residual.sugar to see these variables effects on the quality variable. However, I was surrprised that quality is strongly correlated with volatile.acidity & sulphates.
some of the interesting observations that I found are: 1- high residual.sugar is not normal in wine observations 2- higher alcohol level makes the wine taste dry while the taste would be sweet for levels under 12.5%. 3- sweeter wine has higher density 4- wine observations with alcohol level between 12 & 13 got the highest quality ranks 5- citric.acid level between 0.25 and 0.5 got the highest quality ranks because it adds flavor to wine 6- Volatile.acidity at too high of levels can lead to an unpleasant, vinegar taste. 7- Sulphates can contribute to sulfur dioxide,which in turn is used as a preservative and it affects the wine taste. 8- riper wines will have higher alcohol content, lower acidity and higher pHs.
in addition to the imbalanced dataset issue that was mentioned above, multicolinearity exists between some variables, such as: 1- volatile.acidity & sulphates are correlated 2- citric.acid & sulphates are correlated
I plan to build prediction model to predict the red wine quality rank based on the four variables volatile.acidity, citric.acid, alcohol and sulpates.I will use classification algorithm, such as decision tree. I beleive that finding the right packege and writing the code would be challanging. I might also try logistic regression algorithm, but to do so, I have to change quality to binary variable (high & low).
Refrences: https://bibinmjose.github.io/RedWineDataAnalysis/ + all sources listed above